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Return-Path:^ bluepeak .westend ;com ! popeye 
Return-Path : <popeye@bluepeak . westend . com> 

Received: from popeye.bluepeak.westend.com by bluepeak with smtp 

(Smail3.2 #1) id mOzYbGa-00027DC; Wed, 28 Oct 1998 20:28:24 +0100 (MET) 
Received: from genesis for a.kupries 

with Cubic Circle's cucipop (vl.lO 1996/09/06) Wed Oct 28 20:21:08 1998 
X-From_: jcw@equi4.com Wed Oct 28 19:00:00 1998 

Received: fromvservers.com (root@ [207 . 159 . 153 . 130] (maybe forged)) 
by genesis.westend.com (8.8.6/8.8.6) with ESMTP id SAA06539 
for <a.kupries@westend.com>; Wed, 28 Oct 1998 18:59:57 +0100 (MET) 

Received: frommini.net ([207.159.134.5]) 

by vservers.com (8.9.0/8.9.0) with ESMTP id JAA13141; 
Wed, 28 Oct 1998 09:57:52 -0800 (PST) 

Received: from equi4.com (siepie.equi4.nl [195.108.246.51]) 
by mini.net (8.8.5/8.8.5) with ESMTP id PAA08975; 
Wed, 28 Oct 1998 15:09:13 GMT 

Message-ID: <363732E1 . 9C3BDBCE@equi4 . com> 

Date: Wed, 28 Oct 1998 16:06:09 +0100 

From: Jean-Claude Wippler < j cw@equi4 . com> 

Organization: Equi4 Software - http://www.equi4.com 

X-Mail€ir: Mozilla 4.5b2,[en] (Win98; I). ... 
X-Accept-Language : en 
MIME-Version: 1.0 

To : Alexandre Ferrieux <alexandre . f errieuxl^cnet . f rancetelecom. f r>, 
Andreas Kupries <a. kupries@westend. com>, 
Brent Welch <welch@scriptics . com>, 
Cameron Laird <claird@Starbase . NeoSof t . COM>, 

Larry Virden <lvirden@cas . org>, Mark Roseman <roseman@teamwave . com> 
CC : elmo@onelist . com 
Subject: Dataset servers 

Content-Type : text/plain; charset=us-ascii 
Content-Transfer-Encoding : 7bit 

All, 

This is a restarted attempt to describe some details of what seem to be 
falling into place, and what I am starting to call "dataset servers". 

The first start I made yesterday ended in so much handwaving (so to 
speak), and so much head-in-the-clouds talk that I abandoned it... 

Another attempt, with about one page of context arid ■ justification, then 
the details of what has been set up so far and a few immediate plans. 

THE PROBLEM 

To find a way to share lists of information, such as bookmarks, FAQs, 
and the items we've been focusing on: patches, bug lists, several other 
valuable lists incorporated in "Jumbo", and Tcl-URL!. 

Each of these is a dataset, with an unambigous name, and a way to 
identify entries in them. Nothing new here. 

The problems with today's dataset s are twofold: 

- How to get more people involved to maintain them. Tcl-URL! is a 
recent example: though others are doing the big job of compiling 
content, there's overhead in maintaining the infrastructure, and 
as long as that's manual, it will prevent everyone from being more 
and more productive. Automation is the only way to scale up... 
Content is fun, but the luggage associated with it is not. 

- How to get accurate, painstakingly maintained lists, such as FAQs 
to as many people as possible. Grabbing a copy solves a problem 
now, but creates a maintenance task for the future. And I *want* 
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a topy of lists such as big reports and FAQs - on my disk. Because 
there are very powerful ways to browse them (read: "incrFilter" ) . 

There's another aspect, which is closely related to both problems. The 
web is full of lists, in HTML. Generated from some database perhaps, 
even dynamically (example: Scriptics) , but the data is no longer easily 
accessible as a table of records, with a key and modification times. 
That's fine for a few lines of text, just grab a new copy every once in 
a while, but it's very inconvenient for larger amounts of information. 

When someone makes some changes to in the manual pages, I'd like to get 
those. Not once, not every 6 months, but *f requently* , and with as 
little effort as possible. For bugs and patches, I want a *very* up to 
date set to browse in. After all, some packages are actively updated. 
The "RTFM" phrase is a response to a symptom. The underlying problem is 
that documentation is a *big* hassle to track and maintain. Out of date 
documentation is worth so much less, that we all end up not using it. 

TODAY'S SOLUTIONS 

Go to the web. Always. Use the master copy and never anything else. 
Sure, spend a lot of time (waiting), and a lot of money (connecting)..'. 
Sorry, but this is *not* a solution for most lists. Not for me. 

Check on Usenet, c.l.t for example. When in doubt, post a message. 
Again, that's not workable: Usenet is a volatile medium. Everything of 
value fades away within a week or so. The result, a range of frequently 
asked questions (and regulars doing an unbelievably good ^manual* job of 
replying and pointing people to places where answers can be found) . 

Check DejaNews, the marriage of Usenet and the Web. Ahem, that's not 
too effective: how do you find an answer, even with a perfect search 
engine, in a database which is growing at the rate of a few ^million* 
messages a week... (if I may believe the sequence numbers of DejeNews) . 

Spider a reference web site, and store a local copy. Yes, that's an 
option. It can'take some time to do this, and to re-scan for changes. 
It also is likely to break down on dynamically generated pages. I've 
tried this a few times - but there is no site which covers exactly my 
interests and not too much more. And refresh really is slow. . . 

Rely on peers, send each other tips and hints through email. Subscribe 
to mailing lists and archive all emails. The trouble with this, is that 
email archives are not really *that* useful when it's a collection of a 
few thousand emails, and nothing but plain text search as access method. 
There is too little focus, there is no organization. And it's a drag to 
carefully place each email in its proper spot (one?). Automated or not. 

DATASET SERVERS 

So, let's try something else. Let's make it easy for people to stay in 
control of the datasets they maintain (and own) , while at the same time 
adding an infrastructure for others to get a copy, and to track changes. 
On the web, on servers, as databases, as HTML pages, but also as local 
data, available for use while off-line (most of the internet world is 
still very much affected by the distinction of on-line vs. off-line). 

Let's give such datasets a clear identity, and let's create a mechanism 
whereby anyone can find the master copy, even if it moves occasionally. 
Let's do it in such a way that there is no central management / control. 

Let's aim high, and make this approach future-proof . Usable in all 
sorts of contexts (including those that have not yet been created) . 

Let's keep our feet on the ground and start simple, with some very clear 
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and imfnediate goals. .And let's keep focus on the .benefits right away.. 

Let*s treat organizations and individuals on the same level. Both as 
content providers, and as content consumers. Let the content be public, 
and available for use on websites, as well as on people's local disks. 

Let's create "dataset severs". 

AND THIS IS HOW 

A dataset is a collection of entries (a table of records) . Each dataset 
has a unique name. I'll use the Tcl-URL! posting archive as example, 
because it is furthest along in implementation. 

The dataset name is: "tcl/url". It has two locations on the net 
associated to it : 

http : //purl . org/thecliff /tcl/url . html 

and 

http: //purl .org/thecliff /tcl/url 

This uses teh OCLC ' s "Persistent URL" service. OCLC is an independent 
organization which sees it as its task 'to support and enhance archival 
services. They have a commitment to be around and stay around. Their 
PURL service is a redirection service which re-routes HTTP requests. 
It's similar, but much simpler that the holy grail of archival: URNs . 
More importantly, it works. Today. And it scales well. 

The URL "http://purl.org/thecliff/tcl/url.html" can be entered in a web 
browser, and leads to a page related to the Tcl-URL! archive dataset. 
Its contents and meaning is merely: "this is the home of Tcl-URL! ". 
It can be anywhare - PURLs can be adjusted if it needs to be relocated 
any time in the future. PURLs can never go away. The worst that can 
happen, is that they are set to point to a "null URL". Then the page 
shown will be the history of where the PURL *used* to point to. That 
history is, just like the PURL itself, permanent. Available forever. 

The URL "http://purl.org/thecliff/tcl/url" is the core of this system. 
It points to a "dataset server" which must adhere to a fixed protocol. 
Both PURLs can point to any web server, which need not even be the same. 

There are two management issues for PURLs. They can be owned by one or 
more people - allowing them to alter the redirection, and there are PURL 
domain writers, which lets people create new PURLs (again: no deletion) . 

The domain for this dataset is "http://purl.org/thecliff/tcl/". One or 
more people from the Tcl/Tk community will be given PURL creation access 
eventually, but for now I ask to postpone this until the bigger picture 
becomes clearer and we can agree on guidelines on when to create a new 
PURL and on how to choose names. Extreme restraint is needed, because 
there is no way to get rid of a PURL once defined. Basically, every 
Tcl-related dataset will be given a PURL in the /tcl/ domain. 

PURL ownership is simpler. If Scriptics maintains a "patches" dataset, 
then it is only logical to give Scriptics the tools to decide where this 
dataset will be located. Every PURL can have different owners, though 
it may be a good idea to give a few well-known people and organizations 
in the Tcl community access and let them adjust resources. This is 
likely to happen rarely. Dataset servers tend to stay in one place. 
Though I'm an owner now, future owners can take me off the list later. 

The differences between PURLs and domain names are: 

- ownership/creation of PURLs can be distributed among many 

- changes to a PURL are instant, there is no caching 

- every PURL can redirect to a different site 
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PURLs ^re what make it possible to say '"THE dataset name of the TcI-uSL! 
archive is: tcl/url". Just add a standard prefix, and you'll be able to 
locate its associated dataset server. Append ".html", and you'll be 
going to a page where all the details of that dataset can be viewed. 

For now, we're aiming for the following datasets: 

tcl/url Archive of weekly Tcl-URL! posts (JC) 

tcl/jumbo Alex's resource mix 

tcl/patches Patches, maintained by Scriptics 

Note that the actual ^location* of datasets is irrelevant. There are 
always master copies (in this first approach), but we don't care where. 

Close your eyes. How many datasets will there be a year from now? 
Answer: it doesn't matter, this can scale and is location independent. 

BECOMING FUTURE-PROOF 

So what *is* a "dataset"? Well, that's something I have been staying 
away from - because I hope to deal with this in a very generic manner. 

A dataset is a collection of entries, each of which can be uniquely 
identified with some string (a record ID, a key field, whatever) . There 
is no inherent ordering, though keys may imply an obvious/natural one. 

Each entry also has a "modification timestamp" as attribute. 

The content of entries is unspecified. Anything goes. 

A "dataset server" is what you need to be able to define such datasets 
in the PURL context described above. It is currently defined as a very 
simple HTTP request/reply service. Requests use the CGI approach of 
adding "?..." to the dataset URL to specify the exact request. 

Replies are based on the XML (extensible markup language) notation. 
When you talk to a dataset server, it responds in XML. The protocol of 
these responses is standardized so that all dataset servers act the 
same, though the data itself, and their structure, will vary widely. 
XML is a tree-structured notation which can accomodate all formats. 
And the way things are moving now, XML is set to sweep the net, IMO. 

XML is what will make dataset servers and clients "future proof". There 
is a very rudimentary server, written in Tel, running on tclhttpd, and 
serving the Tcl-URL! dataset right now. It's just over 100 lines. Ten 
years ago, it might have been written in C (you tell me how many lines), 
ten years from now it can be written with a tool that doesn't exist yet. 
XML is transport- / system- / tool- / language- / platform-independent. 

Note that XML strictly defines what is exchanged between dataset servers 
and future dataset client (s), but that the storage of the dataset itself 
is unspecified. Each one can be different. Today's datasets often are. 

So what does a dataset server *do* exactly? Well, not that much - just 
enough to make the whole system work. A dataset server must be able to 
tell clients what has changed since the last time they checked. It does 
this - in response to an HTTP request - by returning a list of commands 
which the client needs to perform to update itself. 

Here's a hypothetical transcript: 

Client: hi, I'm new, what do you have to offer? 
Server: ok, start with a clean • slate,- then add [these entries] 
Server: the last change was on 1998/10/20 12:34:56, by the way 
. . . time passes . . . 

Client: hi again, anything happened since 1998/10/20 12:34:56? 
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Servfer: on yes, lots: add [this] as X, delete Y/ replace Z with [this] 
Server: FYI, the last change was on 1998/10/27 01:23:45 
. . . time passes . . . 

Client: what's up, doc? I last spoke to you on 1998/10/27 01:23:45 
Server: it's been a bit quiet, nothing has happened since then 
. . . and so on ... 

MANAGING THE DATASETS 

A lot of things depend on state changes. Servers hop from one state to 
the next whenever changes are made to their dataset. There are many 
ways in which dataset editing could be implemented. For now, I manually 
apply changes to the tcl/url dataset, since those are always only 
additions. But that will be extended soon. 

One could envision Tk-based utilities which talk to a more sophisticated 
dataset server, identifying themselves, and applying changes from a 
remote location. That could be a single maintainer, or a team of people 
who have agreed to maintain a dataset, or eventually even an way to let 
anyone submit /maintain information. We'll come with many solutions. 

A simpler approach, is to let the server generate HTML forms, and to let 
those forms inert face with the server, using a CGI process or otherwise. 

SERVER PROTOCOL 

We now have independence of just about anything. The one aspect which 
must be clearly defined, is how a server responds to requests. With 
this standard, servers and clients can be implemented. The protocol 
used is new, extensible, and maximally decoupled from content. It is 
also minimalistic . At least, that's what I've tried to accomplish. 

In other words: dataset servers can remain simple, and will handle just 
about any type of data we come up with later. 

There will no doubt be more extensions to the protocol in the future. 

Let me translate the above transcript into what goes over the wire: 

Client : fetch http: //purl . org/theclif f /tcl/url?since=0 

Server: returns the following, as mime type "text/xml": 
<update last="19981020123456000"> 
<reset /> 

<add id=. . . modif ied= . . . > 
</add> 
</update> 

Client: fetch http: //purl . org/theclif f/tcl/url?since=l 998 10201234 5 6000 

Server: returns the following, as mime type "text/xml": 
<update last=" 1998 102701234 5000 "> 
<add id-"X" modif ied= ... > 

</add> 

<delete id="Y" modif ied=. .. /> 
<replace id="Z" modif ied= ... > 

</replace> 
</update> . . - . . 
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Servfer: returns the following, as mime type "text/xml": 
<update last="l 9981027012345000 V> 

Note that everything between <add> and </add> tags is the contents of 
that entry, expressed in well-formed XML notation. You can examine the 
real output to see what it can look like. It all resembles HTML, except 
that the tags do not specify formatting but datastructure . 

Note also that the dataset server doesn't care *what* the data is. It 
is a rocket booster, with a payload (the data) - any payload that fits. 

And that's all there is to it. 

I will be adding the "modified" attribute to the XML output real soon 
(this change will make some neat change propagation tricks possible) . 

If anyone wants to experiment with this stuff on the client side, feel 
free to use the tcl/url dataset server. I expect Steve Ball's TclExpat 
and other XML parsing tools to come in quite handy to grab the contents 
and store it on your site. I hope to work on that a few weeks from now. 

I also intend to add a larger dataset server, one which changes a few 
times a day, so that it will be more useful to see change propagation in 
action. If it turns out to be simple, maybe later this week. 

Ok, shoot : ) 

JC 
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